McKenzie County
Minimum Wasserstein distance estimator under covariate shift: closed-form, super-efficiency and irregularity
Lang, Junjun, Zhang, Qiong, Liu, Yukun
Covariate shift arises when covariate distributions differ between source and target populations while the conditional distribution of the response remains invariant, and it underlies problems in missing data and causal inference. We propose a minimum Wasserstein distance estimation framework for inference under covariate shift that avoids explicit modeling of outcome regressions or importance weights. The resulting W-estimator admits a closed-form expression and is numerically equivalent to the classical 1-nearest neighbor estimator, yielding a new optimal transport interpretation of nearest neighbor methods. We establish root-$n$ asymptotic normality and show that the estimator is not asymptotically linear, leading to super-efficiency relative to the semiparametric efficient estimator under covariate shift in certain regimes, and uniformly in missing data problems. Numerical simulations, along with an analysis of a rainfall dataset, underscore the exceptional performance of our W-estimator.
- Asia > Bangladesh (0.04)
- North America > United States > New York (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- (3 more...)
- South America (0.04)
- Oceania > New Zealand (0.04)
- Oceania > Micronesia (0.04)
- (15 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Media (0.67)
- Education (0.67)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Online Consistency of the Nearest Neighbor Rule
In the realizable online setting, a learner is tasked with making predictions for a stream of instances, where the correct answer is revealed after each prediction. A learning rule is online consistent if its mistake rate eventually vanishes. The nearest neighbor rule (Fix and Hodges, 1951) is a fundamental prediction strategy, but it is only known to be consistent under strong statistical or geometric assumptions--the instances come i.i.d. or the label classes are well-separated. We prove online consistency for all measurable functions in doubling metric spaces under the mild assumption that the instances are generated by a process that is uniformly absolutely continuous with respect to a finite, upper doubling measure.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > San Diego County > La Jolla (0.04)
- North America > United States > North Dakota > McKenzie County (0.04)
- North America > United States > North Dakota > McKenzie County (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada (0.04)
- (2 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > Texas (0.04)
- North America > United States > North Dakota > McKenzie County (0.04)
- (2 more...)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > United States > Texas (0.04)
- North America > United States > North Dakota > McKenzie County (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
Optimal Binary Classification Beyond Accuracy
The vast majority of statistical theory on binary classification characterizes performance in terms of accuracy. However, accuracy is known in many cases to poorly reflect the practical consequences of classification error, most famously in imbalanced binary classification, where data are dominated by samples from one of two classes.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- North America > United States > Texas (0.04)
- North America > United States > North Dakota > McKenzie County (0.04)
- (3 more...)
Unsupervised Document and Template Clustering using Multimodal Embeddings
Sampaio, Phillipe R., Maxcici, Helene
We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + $k$-NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with $k$-Means, DBSCAN, HDBSCAN + $k$-NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.
- South America > Brazil > Rio Grande do Sul > Porto Alegre (0.04)
- North America > United States > North Dakota > McKenzie County (0.04)
- Europe > Switzerland (0.04)
- (3 more...)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > San Diego County > La Jolla (0.04)
- (2 more...)
- North America > United States > Texas (0.04)
- North America > United States > North Dakota > McKenzie County (0.04)